Project Kiva¶

The kiva platform provides money to the small buisnesses around the world. This money is lended as a loan and not as a charity. Ideologically anything given for free losses its value.

Kiva partners with microfinance institutions, nonprofits, and other organizations to disburse loans in the communities they serve.

Kiva never takes fees from lenders or borrowers.Kiva relies on donations, grants, and fees from certain lending partners to cover operational costs.

Individual lenders don't earn interest.Kiva aims at improved financial wellbeing for borrowers

We have a data set from kiva containing details about the lended money, buisnesses it was lended, people or buisnesses that invested, so we can analyze what value addition could kiva do to these buisnesses around the world, and how much did the company impact the life of people.

Data Understanding¶

- funded_amount ... mit Ablauf der "Crowding"zeit erhaltener Betrag/ ausgezahlter Betrag in USD
- loan_amount ... Zielbetrag (Betrag dem man für das Projekt erreichen wollte) in USD
- activity ... Unterkategory zu dem das Ziel des Crowdprojektes thematisch gehört
- sector ... Oberkategory in den das Crowdprojektes Thema fällt
- use ... Kurzbeschreibung wofür das Geld verwendet werden soll
- country_code ... Ländercode nach ISO Norm
- country ... Ländername nach ISO Norm
- region ... Region
- currency ... Währung in den der funded_amount dann ausgezahlt wurde
- term in months ... Dauer über die der Kredit ausgezahlt werden soll
- lender_count ...Darlehensgeber (also wieviele Personen Geld für das Projekt gegeben haben)
- borrower_genders ... Geschlecht und Anzahl der Darlehensnehmer, also diejenigen die das Crowdprojekt initiiert haben       
- repayment interval ... vertraglich vereinbarte Rückzahlungsmodalitäten/-rhythmus

The data dictionary has description of every column.

Data Reading¶

In [2]:
##Importing necessary packages
import pandas as pd
import numpy as np

Reading the only first two entries of the data to identify the seperators.

In [3]:
df_input=pd.read_csv('data_abschlussprojekt.csv',sep='/n',engine='python',nrows=2)
df_input
Out[3]:
# funded_amount# loan_amount# activity# sector# use# country_code# country# region# currency# term_in_months# lender_count# borrower_genders# repayment_interval
0 0#300.0#300.0#Fruits & Vegetables#Food#To buy ...
1 1#575.0#575.0#Rickshaw#Transportation#to repai...

lets read the data into a pandas dataframe

In [4]:
df_kiva=pd.read_csv('data_abschlussprojekt.csv',sep='#',engine='python',skipinitialspace=True,index_col=0)
df_kiva.head()
Out[4]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly

Lets get info about the data frame we input from csv file.

In [5]:
df_kiva.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 671205 entries, 0 to 671204
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   funded_amount       671205 non-null  float64
 1   loan_amount         671205 non-null  float64
 2   activity            671205 non-null  object 
 3   sector              671205 non-null  object 
 4   use                 666972 non-null  object 
 5   country_code        671197 non-null  object 
 6   country             671205 non-null  object 
 7   region              614405 non-null  object 
 8   currency            671205 non-null  object 
 9   term_in_months      671205 non-null  float64
 10  lender_count        671205 non-null  int64  
 11  borrower_genders    666984 non-null  object 
 12  repayment_interval  671205 non-null  object 
dtypes: float64(3), int64(1), object(9)
memory usage: 71.7+ MB

Observations

-there are 671205 rows of data, and 13 columns.
-the column 'use','country_code','region',borrower_genders' has null values

Suggestions

-the column currency has data type object that can be changed to category.
-the missing values in 'country_code' and 'country' can be filled by checking corresponding values.

Lets get more info over the data using describe

In [6]:
df_kiva.describe()
Out[6]:
funded_amount loan_amount term_in_months lender_count
count 671205.000000 671205.000000 671205.000000 671205.000000
mean 785.995061 842.397107 13.739022 20.590922
std 1130.398941 1198.660073 8.598919 28.459551
min 0.000000 25.000000 1.000000 0.000000
25% 250.000000 275.000000 8.000000 7.000000
50% 450.000000 500.000000 13.000000 13.000000
75% 900.000000 1000.000000 14.000000 24.000000
max 100000.000000 100000.000000 158.000000 2986.000000

The parmeters in describe are in the appropriate range.There is no unusual observation in the above data.

Lets check the column names for anamoly.

In [7]:
df_kiva.columns
Out[7]:
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
       'country_code', 'country', 'region', 'currency', 'term_in_months',
       'lender_count', 'borrower_genders', 'repayment_interval'],
      dtype='object')

There are no empty spaces or special signs in the column names

Lets now analyze the unique values in all the columns.

In [8]:
df_kiva.nunique()
Out[8]:
funded_amount            610
loan_amount              479
activity                 163
sector                    15
use                   423452
country_code              86
country                   87
region                 12695
currency                  67
term_in_months           148
lender_count             503
borrower_genders       11298
repayment_interval         4
dtype: int64

Duplicates¶

In [9]:
df_kiva.loc[df_kiva.duplicated()]
Out[9]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
498 100.0 100.0 Home Energy Personal Use to buy a solar-powered lamp. SV El Salvador NaN USD 14.0 4 male monthly
606 100.0 100.0 Home Energy Personal Use to buy a solar-powered lamp. SV El Salvador NaN USD 14.0 4 male monthly
808 450.0 450.0 Higher education costs Education to pay for one semester's registration fees. CO Colombia Bogotà COP 7.0 15 female monthly
1703 500.0 500.0 Higher education costs Education To buy a laptop for educational purposes. SO Somalia Hargeisa USD 8.0 19 male monthly
2317 250.0 250.0 Poultry Agriculture to purchase poultry. KE Kenya Ndaragwa KES 16.0 10 female monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
671200 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly
671201 25.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 1 female monthly
671202 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 NaN monthly
671203 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly
671204 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly

24588 rows × 13 columns

There are 24588 values of duplicate values.

We can select non numerical values or string values as subset to define the duplicates. So we select columns like 'activity', 'sector', 'use','country','region','borrower_gender','repayment_interval'

Lets analyse with activity, sector and use first.

In [10]:
df_kiva.loc[df_kiva.duplicated(subset=['activity','sector','use']),:]
Out[10]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
27 475.0 475.0 Rickshaw Transportation to cover the cost of repairing rickshaw PK Pakistan Lahore PKR 12.0 17 female irregular
47 250.0 250.0 Farming Agriculture to purchase potato seeds and fertilizers for g... IN India Dhupguri INR 43.0 10 female bullet
142 4750.0 4750.0 Personal Housing Expenses Housing to pay for the construction of a hygienic toil... VN Vietnam Hai Duong VND 14.0 162 female, female, female, female, female, female... bullet
154 1325.0 1325.0 Food Production/Sales Food to buy shea nuts. ML Mali Kadiolo XOF 7.0 50 female, female, female, female, female monthly
199 2850.0 2850.0 Personal Housing Expenses Housing to pay for the construction of a hygienic toil... VN Vietnam Hai Duong VND 14.0 86 female, female, female, female, female, female bullet
... ... ... ... ... ... ... ... ... ... ... ... ... ...
671200 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly
671201 25.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 1 female monthly
671202 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 NaN monthly
671203 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly
671204 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 0 female monthly

229263 rows × 13 columns

There are 24588 values of duplicates in the data , and some of the rows are identical to each other. so we use drop_duplicates to deal with the inconsistency in the data.

In [11]:
df_kiva.drop_duplicates(inplace=True)
df_kiva
Out[11]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 female irregular
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 female, female irregular
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 female bullet
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 female irregular
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 female monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 female monthly
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 female monthly
671182 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 female, female monthly
671184 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 female, female monthly
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 female monthly

646617 rows × 13 columns

671205 total values

24588 duplicate values

646617 after removing duplicates

Missing values¶

In [12]:
df_kiva.isnull().sum()
Out[12]:
funded_amount             0
loan_amount               0
activity                  0
sector                    0
use                    3900
country_code              8
country                   0
region                56158
currency                  0
term_in_months            0
lender_count              0
borrower_genders       3888
repayment_interval        0
dtype: int64
In [13]:
df_kiva.columns
Out[13]:
Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'use',
       'country_code', 'country', 'region', 'currency', 'term_in_months',
       'lender_count', 'borrower_genders', 'repayment_interval'],
      dtype='object')
In [14]:
df_kiva.dtypes
Out[14]:
funded_amount         float64
loan_amount           float64
activity               object
sector                 object
use                    object
country_code           object
country                object
region                 object
currency               object
term_in_months        float64
lender_count            int64
borrower_genders       object
repayment_interval     object
dtype: object

Column country code¶

There are 8 missing values in country code , lets analyse if we can subsitute those values based on the country.

In [15]:
df_kiva.loc[df_kiva['country_code'].isnull(),['country_code','country']]
Out[15]:
country_code country
202537 NaN Namibia
202823 NaN Namibia
344929 NaN Namibia
351177 NaN Namibia
420953 NaN Namibia
421218 NaN Namibia
487207 NaN Namibia
487653 NaN Namibia

The country code for Namibia as per ISO norm is NA.

https://www.iso.org/obp/ui/#iso:code:3166:NA Lets substitute NaN in country code for country Namibia as NA

In [16]:
df_kiva.loc[df_kiva['country_code'].isnull(),['country_code']]='NA'

###check if replacement occured
df_kiva.loc[df_kiva['country']=='Namibia',['country_code','country']] 
Out[16]:
country_code country
202537 NA Namibia
202823 NA Namibia
344929 NA Namibia
351177 NA Namibia
420953 NA Namibia
421218 NA Namibia
487207 NA Namibia
487653 NA Namibia
In [17]:
df_kiva['repayment_interval'].unique()
Out[17]:
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)

Column borrower genders¶

In [18]:
borrower_array=df_kiva['borrower_genders'].unique()
borrower_array
Out[18]:
array(['female', 'female, female', 'female, female, female', ...,
       'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female',
       'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male',
       'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male'],
      dtype=object)
In [19]:
df_kiva.loc[df_kiva['borrower_genders'].isnull(),:]
Out[19]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count borrower_genders repayment_interval
140 2975.0 2975.0 Food Production/Sales Food NaN TZ Tanzania NaN TZS 10.0 110 NaN monthly
145 1200.0 1200.0 Personal Expenses Personal Use NaN PE Peru NaN PEN 20.0 44 NaN monthly
170 4250.0 4250.0 Catering Food NaN TZ Tanzania NaN TZS 10.0 116 NaN monthly
412 2350.0 2350.0 Beauty Salon Services NaN TZ Tanzania NaN TZS 10.0 75 NaN monthly
414 725.0 725.0 Agriculture Agriculture NaN SV El Salvador NaN USD 20.0 19 NaN monthly
... ... ... ... ... ... ... ... ... ... ... ... ... ...
659604 5625.0 10000.0 Weaving Arts NaN BT Bhutan NaN USD 14.0 210 NaN irregular
660788 1975.0 1975.0 Home Energy Personal Use NaN PS Palestine NaN USD 27.0 39 NaN monthly
661718 800.0 1600.0 Furniture Making Manufacturing NaN HT Haiti NaN HTG 13.0 27 NaN irregular
671151 0.0 25.0 Livestock Agriculture NaN KE Kenya NaN KES 13.0 0 NaN monthly
671174 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 NaN monthly

3888 rows × 13 columns

In [20]:
## Function to identify the special signs in the column borrower_genders
signs=[]
for x in borrower_array:
    try:
        int(x)
    except:
        signs.append(x)
#signs
In [21]:
df_kiva['borrower_genders']=df_kiva['borrower_genders'].astype('string')
df_kiva['borrower_genders'].dtypes
df_kiva.loc[:,'borrower_genders']
Out[21]:
0                 female
1         female, female
2                 female
3                 female
4                 female
               ...      
671179            female
671181            female
671182    female, female
671184    female, female
671188            female
Name: borrower_genders, Length: 646617, dtype: string
In [22]:
list_count_fnm=[]
for x in df_kiva.loc[:,'borrower_genders']:
    try:
        borrower=x.split(',')
        borrower_count=[]
        count_female = 0
        count_male = 0
        for i in range(0,len(borrower)):
        # if match found increase count
            if (borrower[i] == 'female'):
                   count_female = count_female + 1
            elif (borrower[i] =='male'):
                    count_male = count_male + 1
        lender_count=[count_female,count_male]           
    except:
        lender_count=[x]  
    list_count_fnm.append(lender_count)
#list_count_fnm

From above preprocessing we observe that the word female and male is repeated many times in each row.

Here we try to find out the no of times the words male and female are repeated in the column.

This data will be useful to find out the percentage of female borrowers and male borrowers.

In [23]:
list_count_fnm=[]

for x in df_kiva.loc[:,'borrower_genders']:
    borrower_count=[]
    try:
        count_female=x.count('female')
        count_male= x.count('male')
        borrower_count=[count_female,count_male]
        
    except:
        borrower_count=x
    list_count_fnm.append(borrower_count) 
#len(list_count_fnm)
#list_count_fnm
In [24]:
import re

list_count_fnm=[]

for x in df_kiva.loc[:,'borrower_genders']:
    borrower_count=[]
    try:
        count_female = len(re.findall(r'\bfemale\b', x))
        count_male = len(re.findall(r'\bmale\b', x))
        borrower_count=[count_female,count_male]
        
    except:
        borrower_count=x
    list_count_fnm.append(borrower_count)

#list_count_fnm
#len(list_count_fnm)

we have now obtained a list of female borrowers and male borrowers , now we can split this list as female borrowers and male borrowers.

Further we can append this as a column in the main dataframe df_kiva

In [25]:
## female borrower count

borrower_female=[]
for i in list_count_fnm:
    try:
        borrower_female.append(i[0])
        
    except:
        borrower_female.append(i)
len(borrower_female)
#borrower_female
Out[25]:
646617
In [26]:
## male borrower count

borrower_male=[]

for i in list_count_fnm:
    try:
        borrower_male.append(i[1])
        
    except:
        borrower_male.append(i)

#borrower_male
len(borrower_male)
Out[26]:
646617

We can now add 2 columns as borrower_female and borrower_male instead of borrower_genders

In [27]:
df_kiva['borrower_female']=borrower_female
#df_kiva['borrower_female']=df_kiva['borrower_female'].astype(int) ### has NA type conversion doesnot happen

#df_kiva
In [28]:
df_kiva['borrower_male']=borrower_male
#df_kiva
In [29]:
df_kiva_v1=df_kiva.drop('borrower_genders',axis=1)
df_kiva_v1
Out[29]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 irregular 1 0
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 irregular 2 0
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 bullet 1 0
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 irregular 1 0
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 monthly 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 monthly 1 0
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 monthly 1 0
671182 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 monthly 2 0
671184 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 monthly 2 0
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 monthly 1 0

646617 rows × 14 columns

In [30]:
df_kiva_v1.dtypes
Out[30]:
funded_amount         float64
loan_amount           float64
activity               object
sector                 object
use                    object
country_code           object
country                object
region                 object
currency               object
term_in_months        float64
lender_count            int64
repayment_interval     object
borrower_female        object
borrower_male          object
dtype: object
In [31]:
df_kiva_v1['borrower_female'].min() ### output is zero , so we check corresponding value in male borrower column
Out[31]:
0
In [32]:
df_kiva_v1.loc[df_kiva_v1.loc[:,'borrower_female']==0,'borrower_male']
Out[32]:
49        1
50        5
53        1
67        1
69        1
         ..
671136    1
671137    1
671143    1
671145    1
671147    1
Name: borrower_male, Length: 133307, dtype: object

It has 133307 values now we will see if both male and female values are zero at the same time

In [33]:
df_kiva_v1.loc[(df_kiva_v1.loc[:,'borrower_female']==0) & (df_kiva_v1.loc[:,'borrower_male']==0),:]
Out[33]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male

There are no such values so we can check the OR condition now

In [34]:
df_kiva_v1.loc[(df_kiva_v1.loc[:,'borrower_female']==0) | (df_kiva_v1.loc[:,'borrower_male']==0),:]
Out[34]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 irregular 1 0
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 irregular 2 0
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 bullet 1 0
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 irregular 1 0
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 monthly 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 monthly 1 0
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 monthly 1 0
671182 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 monthly 2 0
671184 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 monthly 2 0
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 monthly 1 0

604210 rows × 14 columns

Most of the values have either female or male value zero.

We can also check this by group by and aggregation

female & male 0 0 ----> zero cases

0 OR 0 ----> 604210 cases

0 & 1 ----> 129726 cases

1 & 0 ----> 415152 cases

1 & 1 ----> 2475 cases

In [35]:
df_kiva_v1.loc[(df_kiva_v1.loc[:,'borrower_female']==0) & (df_kiva_v1.loc[:,'borrower_male']==1),:]
Out[35]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male
49 450.0 450.0 General Store Retail to stock his store. SV El Salvador NaN USD 14.0 18 monthly 0 1
53 550.0 550.0 Personal Expenses Personal Use to buy a cradle and household items for his yo... PE Peru Ica PEN 14.0 20 monthly 0 1
67 125.0 125.0 Energy Services purchase solar lanterns for resale. KE Kenya NaN KES 3.0 6 irregular 0 1
69 1075.0 1075.0 Transportation Transportation to pay for general maintenance on his vehicle. PE Peru Ica PEN 17.0 41 monthly 0 1
70 2000.0 2000.0 Retail Retail to install a display window and a sunshade for... IQ Iraq NaN USD 15.0 71 monthly 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671136 0.0 350.0 Personal Medical Expenses Health to buy medicine to start treatment. TJ Tajikistan Qubadiyon TJS 14.0 0 monthly 0 1
671137 50.0 1150.0 Higher education costs Education to pay his nephews' tuition, so that they can ... TJ Tajikistan Qurgan-Tube TJS 14.0 2 monthly 0 1
671143 125.0 1175.0 Farming Agriculture to purchase seeds and fertilizer for his green... UA Ukraine Gladkovka UAH 14.0 5 monthly 0 1
671145 175.0 775.0 Flowers Agriculture to purchase seedlings, pots for flowers and fe... UA Ukraine Vinogradovo UAH 12.0 7 monthly 0 1
671147 0.0 250.0 Food Stall Food to buy watermelon to start a new business. TJ Tajikistan Spitamen TJS 14.0 0 monthly 0 1

129726 rows × 14 columns

In [36]:
df_kiva_v1.loc[(df_kiva_v1.loc[:,'borrower_female']==1) & (df_kiva_v1.loc[:,'borrower_male']==0),:]
Out[36]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 irregular 1 0
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 bullet 1 0
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 irregular 1 0
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 monthly 1 0
5 250.0 250.0 Services Services purchase leather for my business using ksh 20000. KE Kenya NaN KES 4.0 6 irregular 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671166 25.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya NaN KES 13.0 1 monthly 1 0
671168 0.0 25.0 Livestock Agriculture Pretend the issue with loan got addressed by K... KE Kenya NaN KES 13.0 0 monthly 1 0
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 monthly 1 0
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 monthly 1 0
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 monthly 1 0

415152 rows × 14 columns

In [37]:
df_kiva_v1.loc[(df_kiva_v1.loc[:,'borrower_female']==1) & (df_kiva_v1.loc[:,'borrower_male']==1),:]
Out[37]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male
4854 625.0 625.0 Farming Agriculture to harvesting the paddy rice. KH Cambodia Battambang province, Moung Russey district KHR 14.0 24 monthly 1 1
5026 775.0 775.0 Auto Repair Services to buy spare parts for a mototaxi GT Guatemala San Martin Jilotepeque, Chimaltenango GTQ 14.0 22 monthly 1 1
7112 200.0 200.0 Farming Agriculture to purchase hybrid seeds and fertilizer to gro... KE Kenya Kuria west KES 10.0 8 bullet 1 1
7906 100.0 100.0 Home Appliances Personal Use To build a decent latrine for his household KH Cambodia Kandal province, Kandal Stoeng district KHR 14.0 1 monthly 1 1
7942 725.0 725.0 Retail Retail to expand his business on a larger scale. KH Cambodia Battambang province, Sangke district KHR 14.0 15 monthly 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
669199 0.0 1000.0 Farming Agriculture to purchase rice seeds for farming KH Cambodia Svay Reang Province KHR 20.0 0 monthly 1 1
669818 50.0 750.0 Agriculture Agriculture to buy fertilizer, oil and seeds for growing r... KH Cambodia Kratie Province, Prey Prorsorb District KHR 19.0 2 monthly 1 1
669835 0.0 1000.0 Fishing Food to buy and assemble a gill net and buy a motor CD The Democratic Republic of the Congo IDJWI USD 11.0 0 monthly 1 1
669903 25.0 700.0 Fishing Food to purchase a canoe for his fishing business a... CD The Democratic Republic of the Congo IDJWI USD 9.0 1 monthly 1 1
670860 0.0 800.0 Farming Agriculture to maintain his pineapple field and to buy pin... CD The Democratic Republic of the Congo IDJWI USD 14.0 0 monthly 1 1

2475 rows × 14 columns

In [38]:
#df_kiva_v1.groupby(by=['borrower_female','borrower_male']).agg(countf=('borrower_female','count'),countm=('borrower_male','count'))

There are minimum of 1 and 1 value in both female and male column.

there are 3888 missing values in both columns.

We can replace these missing values with 1 as there has to be atleast 1 borrower either its female or male, the kiva website says they have more female borrowers.

To be fair in our analysis we can replace missing values with 1 in both columns.

In [39]:
df_kiva_v1['borrower_female'].fillna(1, inplace=True)
df_kiva_v1.loc[df_kiva_v1.loc[:,'borrower_female'].isnull()]
Out[39]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male
In [40]:
df_kiva_v1['borrower_male'].fillna(1, inplace=True)
df_kiva_v1.loc[df_kiva_v1.loc[:,'borrower_male'].isnull()]
Out[40]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male
In [41]:
df_kiva_v1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 646617 entries, 0 to 671188
Data columns (total 14 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   funded_amount       646617 non-null  float64
 1   loan_amount         646617 non-null  float64
 2   activity            646617 non-null  object 
 3   sector              646617 non-null  object 
 4   use                 642717 non-null  object 
 5   country_code        646617 non-null  object 
 6   country             646617 non-null  object 
 7   region              590459 non-null  object 
 8   currency            646617 non-null  object 
 9   term_in_months      646617 non-null  float64
 10  lender_count        646617 non-null  int64  
 11  repayment_interval  646617 non-null  object 
 12  borrower_female     646617 non-null  int64  
 13  borrower_male       646617 non-null  int64  
dtypes: float64(3), int64(3), object(8)
memory usage: 74.0+ MB
In [42]:
df_kiva_v1.isnull().sum()
Out[42]:
funded_amount             0
loan_amount               0
activity                  0
sector                    0
use                    3900
country_code              0
country                   0
region                56158
currency                  0
term_in_months            0
lender_count              0
repayment_interval        0
borrower_female           0
borrower_male             0
dtype: int64

Column repayment_interval¶

In [43]:
df_kiva_v1['repayment_interval'].unique()
Out[43]:
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)

This column has only four unique values and it has a data type object. we can convert this data type to category to save memory usauge

In [44]:
df_kiva_v1.loc[:,'repayment_interval'] = df_kiva_v1.loc[:,'repayment_interval'].astype('category')
In [45]:
df_kiva_v1.dtypes
Out[45]:
funded_amount          float64
loan_amount            float64
activity                object
sector                  object
use                     object
country_code            object
country                 object
region                  object
currency                object
term_in_months         float64
lender_count             int64
repayment_interval    category
borrower_female          int64
borrower_male            int64
dtype: object

Columns use and region¶

The columns 'activity','sector', 'use' seem to be related.

Lets check the text input in these columns.

In [46]:
df_kiva_v1.iloc[:,2:5]
Out[46]:
activity sector use
0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell.
1 Rickshaw Transportation to repair and maintain the auto rickshaw used ...
2 Transportation Transportation To repair their old cycle-van and buy another ...
3 Embroidery Arts to purchase an embroidery machine and a variet...
4 Milk Sales Food to purchase one buffalo.
... ... ... ...
671179 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri...
671181 Livestock Agriculture Reviewed loan use in english.
671182 Livestock Agriculture Pretend the flagged issue was addressed by KC.
671184 Livestock Agriculture Translated loan use to english.
671188 Livestock Agriculture Reviewed loan use in english.

646617 rows × 3 columns

In [47]:
#df_kiva_v1.groupby(by='activity').agg(count=('activity','count'))
In [48]:
df_kiva_v1['activity'].nunique()
Out[48]:
163
In [49]:
df_kiva_v1['use'].nunique()
Out[49]:
423452
In [50]:
df_kiva_v1.loc[df_kiva.loc[:,'use'].isnull() & 
            df_kiva.loc[:,'region'].isnull() , :]
Out[50]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male
140 2975.0 2975.0 Food Production/Sales Food NaN TZ Tanzania NaN TZS 10.0 110 monthly 1 1
145 1200.0 1200.0 Personal Expenses Personal Use NaN PE Peru NaN PEN 20.0 44 monthly 1 1
170 4250.0 4250.0 Catering Food NaN TZ Tanzania NaN TZS 10.0 116 monthly 1 1
412 2350.0 2350.0 Beauty Salon Services NaN TZ Tanzania NaN TZS 10.0 75 monthly 1 1
414 725.0 725.0 Agriculture Agriculture NaN SV El Salvador NaN USD 20.0 19 monthly 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
659604 5625.0 10000.0 Weaving Arts NaN BT Bhutan NaN USD 14.0 210 irregular 1 1
660788 1975.0 1975.0 Home Energy Personal Use NaN PS Palestine NaN USD 27.0 39 monthly 1 1
661718 800.0 1600.0 Furniture Making Manufacturing NaN HT Haiti NaN HTG 13.0 27 irregular 1 1
671151 0.0 25.0 Livestock Agriculture NaN KE Kenya NaN KES 13.0 0 monthly 1 1
671174 0.0 25.0 Games Entertainment NaN KE Kenya NaN KES 13.0 0 monthly 1 1

3894 rows × 14 columns

When both use column and region column have nan values there are significant values of funded amount, loan amount for every row , so we cannot just drop these na values.

Idea 1: we can concat column activity and sector and add it to column use. then those missing values will be filled.

Idea 2: we can substitute 'Unknow' instead of missing values.

Idea 3: The column region will be useful to plot on the world map but it has lots of values missing. We can subsitute like column use with unknown.

Idea 4 : The column region can be deleted as a whole and we can just use country to plot on world map.
In [51]:
df_kiva_v1.dtypes
Out[51]:
funded_amount          float64
loan_amount            float64
activity                object
sector                  object
use                     object
country_code            object
country                 object
region                  object
currency                object
term_in_months         float64
lender_count             int64
repayment_interval    category
borrower_female          int64
borrower_male            int64
dtype: object
In [52]:
df_kiva_v1['use'].fillna('Unknown',inplace=True)
In [53]:
df_kiva_v1['region'].fillna('Unknown',inplace=True)
In [54]:
df_kiva_v1.isnull().sum()
Out[54]:
funded_amount         0
loan_amount           0
activity              0
sector                0
use                   0
country_code          0
country               0
region                0
currency              0
term_in_months        0
lender_count          0
repayment_interval    0
borrower_female       0
borrower_male         0
dtype: int64

Column currency, country¶

In [55]:
df_kiva_v1.describe()
Out[55]:
funded_amount loan_amount term_in_months lender_count borrower_female borrower_male
count 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000
mean 808.045242 866.458816 13.719826 21.115360 1.610250 0.419273
std 1145.419041 1214.276106 8.479912 28.840674 3.077908 1.133063
min 0.000000 25.000000 1.000000 0.000000 0.000000 0.000000
25% 275.000000 275.000000 8.000000 7.000000 1.000000 0.000000
50% 475.000000 500.000000 13.000000 13.000000 1.000000 0.000000
75% 925.000000 1000.000000 14.000000 25.000000 1.000000 1.000000
max 100000.000000 100000.000000 158.000000 2986.000000 50.000000 44.000000

The columns funded amount and lender count have 0 as minimum value and loan amount has minimum of 25.

Lowest funded value is zero

In [56]:
df_kiva_v1.loc[(df_kiva_v1.loc[:,'funded_amount']==0) & 
               (df_kiva_v1.loc[:,'lender_count']==0) &
               (df_kiva_v1.loc[:,'loan_amount']==25),:]
Out[56]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male
671151 0.0 25.0 Livestock Agriculture Unknown KE Kenya Unknown KES 13.0 0 monthly 1 1
671160 0.0 25.0 Livestock Agriculture [True, u'to start a turducken farm.'] - this l... KE Kenya Unknown KES 13.0 0 monthly 1 0
671167 0.0 25.0 Livestock Agriculture Kiva Coordinator fixed issue loan (no longer v... KE Kenya Unknown KES 13.0 0 monthly 2 0
671168 0.0 25.0 Livestock Agriculture Pretend the issue with loan got addressed by K... KE Kenya Unknown KES 13.0 0 monthly 1 0
671171 0.0 25.0 Games Entertainment Kiva Coordinator replaced loan use. Should see... KE Kenya Unknown KES 13.0 0 monthly 2 0
671174 0.0 25.0 Games Entertainment Unknown KE Kenya Unknown KES 13.0 0 monthly 1 1
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 monthly 1 0
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 monthly 1 0

Highest funded value is 100000 USD which also has highest lender count.

In [57]:
df_kiva_v1.loc[(df_kiva_v1.loc[:,'funded_amount']==100000) ,:]
Out[57]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male
70499 100000.0 100000.0 Agriculture Agriculture create more than 300 jobs for women and farmer... HT Haiti Les Cayes USD 75.0 2986 irregular 1 0

Distribution or total funded amount as per country

In [58]:
df_country=df_kiva_v1.groupby(by='country').agg(total=('funded_amount','sum'))
df_country=df_country.sort_values(by='total',ascending=False)
df_country
Out[58]:
total
country
Philippines 53171450.0
Kenya 31805605.0
Peru 30386450.0
Paraguay 29401725.0
El Salvador 23345975.0
... ...
Mauritania 15000.0
Afghanistan 14000.0
Vanuatu 9250.0
Guam 395.0
Virgin Islands 0.0

87 rows × 1 columns

In [59]:
import plotly_express as px

mybar=px.bar(df_country, 
            x=df_country.columns,
            y=df_country.index,
            labels={'x':'Country','y':'Total funded amount'})
mybar.show()
In [60]:
df_currency=df_kiva_v1.groupby(by='currency').agg(total=('funded_amount','sum'))
df_currency
Out[60]:
total
currency
ALL 2440000.0
AMD 6582800.0
AZN 2696025.0
BIF 2558550.0
BOB 16350975.0
... ...
XOF 20000575.0
YER 1784075.0
ZAR 531275.0
ZMW 1074975.0
ZWD 50.0

67 rows × 1 columns

In [61]:
df_kiva_v1['currency'].unique()
Out[61]:
array(['PKR', 'INR', 'KES', 'NIO', 'USD', 'TZS', 'PHP', 'PEN', 'XOF',
       'LRD', 'VND', 'HNL', 'MNT', 'COP', 'GTQ', 'TJS', 'BOB', 'YER',
       'KHR', 'GHS', 'SLL', 'HTG', 'CLP', 'JOD', 'UGX', 'BIF', 'IDR',
       'GEL', 'UAH', 'EUR', 'ALL', 'CRC', 'XAF', 'TRY', 'AZN', 'DOP',
       'BRL', 'MXN', 'KGS', 'AMD', 'PYG', 'LBP', 'WST', 'ILS', 'RWF',
       'ZMW', 'NPR', 'MZN', 'ZAR', 'BZD', 'SRD', 'NGN', 'VUV', 'XCD',
       'MWK', 'LAK', 'MMK', 'ZWD', 'MDL', 'SSP', 'SBD', 'CNY', 'EGP',
       'MGA', 'NAD', 'LSL', 'THB'], dtype=object)
In [62]:
df_kiva_v1['country'].unique()
Out[62]:
array(['Pakistan', 'India', 'Kenya', 'Nicaragua', 'El Salvador',
       'Tanzania', 'Philippines', 'Peru', 'Senegal', 'Cambodia',
       'Liberia', 'Vietnam', 'Iraq', 'Honduras', 'Palestine', 'Mongolia',
       'United States', 'Mali', 'Colombia', 'Tajikistan', 'Guatemala',
       'Ecuador', 'Bolivia', 'Yemen', 'Ghana', 'Sierra Leone', 'Haiti',
       'Chile', 'Jordan', 'Uganda', 'Burundi', 'Burkina Faso',
       'Timor-Leste', 'Indonesia', 'Georgia', 'Ukraine', 'Kosovo',
       'Albania', 'The Democratic Republic of the Congo', 'Costa Rica',
       'Somalia', 'Zimbabwe', 'Cameroon', 'Turkey', 'Azerbaijan',
       'Dominican Republic', 'Brazil', 'Mexico', 'Kyrgyzstan', 'Armenia',
       'Paraguay', 'Lebanon', 'Samoa', 'Israel', 'Rwanda', 'Zambia',
       'Nepal', 'Congo', 'Mozambique', 'South Africa', 'Togo', 'Benin',
       'Belize', 'Suriname', 'Thailand', 'Nigeria', 'Mauritania',
       'Vanuatu', 'Panama', 'Virgin Islands',
       'Saint Vincent and the Grenadines',
       "Lao People's Democratic Republic", 'Malawi', 'Myanmar (Burma)',
       'Moldova', 'South Sudan', 'Solomon Islands', 'China', 'Egypt',
       'Guam', 'Afghanistan', 'Madagascar', 'Namibia', 'Puerto Rico',
       'Lesotho', "Cote D'Ivoire", 'Bhutan'], dtype=object)
In [63]:
unique_cc = df_kiva_v1.groupby('country')['currency'].apply(pd.Series.unique)
unique_cc
Out[63]:
country
Afghanistan            [USD]
Albania           [ALL, USD]
Armenia           [AMD, USD]
Azerbaijan        [AZN, USD]
Belize                 [BZD]
                     ...    
Vietnam                [VND]
Virgin Islands         [USD]
Yemen                  [YER]
Zambia            [ZMW, USD]
Zimbabwe          [USD, ZWD]
Name: currency, Length: 87, dtype: object

There are 87 countries and 67 currencies. USD is world wide accepted currency. Some countries are using both their own currency and USD. The values of funded_amount and loan_amount are in USD so the currency column has no impact on any other columns. As stated in data dictionary ,it represents the currency in which it is recieved by borrowers. Again it has no impact on any columns in data.

Box plot ( First visualization )¶

In [64]:
import matplotlib.pyplot as plt
import seaborn as sns
In [65]:
# select numerical columns
num_cols = ['funded_amount', 'loan_amount', 'term_in_months', 'lender_count','borrower_female','borrower_male']

# create subplots for each column
fig, axes = plt.subplots(nrows=1, ncols=len(num_cols), figsize=(15,5))

# create boxplot for each column in its corresponding subplot
for i, col in enumerate(num_cols):
    sns.boxplot(x=df_kiva_v1[col], ax=axes[i])
    axes[i].set_title(col)

plt.show()
In [66]:
df_kiva_v1.corr()
Out[66]:
funded_amount loan_amount term_in_months lender_count borrower_female borrower_male
funded_amount 1.000000 0.944453 0.154112 0.847721 0.468677 0.196461
loan_amount 0.944453 1.000000 0.191142 0.796713 0.440255 0.207362
term_in_months 0.154112 0.191142 1.000000 0.233640 -0.153909 -0.027537
lender_count 0.847721 0.796713 0.233640 1.000000 0.262901 0.170445
borrower_female 0.468677 0.440255 -0.153909 0.262901 1.000000 0.172336
borrower_male 0.196461 0.207362 -0.027537 0.170445 0.172336 1.000000

The Range of data in all the columns varies widely.

We can still see a co relation that as the loan amount or funded amount has increased the lenders has also increased.

This data can be better analysed by grouping it with respect to country or sector.

We can gain insights from this data as:

-Which country has highest funded amount?
-which sector has highest funds?
-what is the percentage of female and male borrowers?
-which sector is falling short of funds?
-which sectors are the lenders willing to contribute?
-which country are the lenders willing to contribute?
-what is the average amount lended per person?

We can visualize the data with respect to following factors:

-Distribution of funds countrywise and sector wise.
-Distribution of funds geographically plotted on a map.
-Distribution of fund in males/females.

Extreme Values¶

It is in the interest of our analysis to keep the extreme values.

There are funded_amount as high as 100000 USD and as low as Zero. We can thus observe that the Kiva platform is catering to a large range of buisnesses. It thus creates impact at the highest economic levels to the most remote parts of the society.

The extreme values appear to be outliers in the statistical sense but when we analyse the same data with respect to country or sector in which the investments happen, we can analyse what kind of buisnesses are success or failures, we can thus look at a larger picture.

Extracting information from one column¶

Refering to the section 1.4.2 Column borrower_genders, it was used to extract information of no of males and no of females borrowing money for a project.

This data extraction reduces the memory usuage for the data frame, and also helps to gain insights from the data.

Combining raw data to find KPI¶

We can combine some values to form KPI

-payback_amount
-deficit_amount_per
-avg_lending_amount

  1. we have data in column funded_amount and term_in_month, from this we can derive the payback_amount as

      payback_amount= funded_amount/term_in_month
      the minimum value in term_in_month is 1 , division by zero error does not occur.
    
      we can compare this value with the repayment_interval column to find out how repayment happens.
      The borrowers can accordingly plan the repayment of loan. 
  2. The difference between the loan amount and funded amount gives us the deficit amount

     deficit_amount=loan_amount -funded_amount
    
     We can give the deficit amount for each project and this helps to further analyse which projects did not recieve funding and what are the reasons.
  3. The funded_amount is collective sum of many lenders. The average amount lended per person can help us analyse the lending power by individuals.

     avg_lending_amount= loan_amount/lender_count 
    
     lender_count has minimum value zero , that can be substitued by the minimum loan amount
    
     This feature can help analyse lending power of people.

Feature 1: payback_amount¶

In [67]:
repayment_amount=round((df_kiva_v1['funded_amount']/df_kiva_v1['term_in_months']),2)
repayment_amount
Out[67]:
0         25.00
1         52.27
2          3.49
3         18.18
4         28.57
          ...  
671179     0.00
671181     0.00
671182     0.00
671184     0.00
671188     0.00
Length: 646617, dtype: float64
In [68]:
df_kiva_v1.loc[:,'repayment_amount']=repayment_amount
df_kiva_v1
Out[68]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male repayment_amount
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 irregular 1 0 25.00
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 irregular 2 0 52.27
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 bullet 1 0 3.49
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 irregular 1 0 18.18
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 monthly 1 0 28.57
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 monthly 1 0 0.00
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 monthly 1 0 0.00
671182 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 monthly 2 0 0.00
671184 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 monthly 2 0 0.00
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 monthly 1 0 0.00

646617 rows × 15 columns

Feature 2: deficit_amount_per¶

In [69]:
deficit_amount = df_kiva_v1['loan_amount'] -df_kiva_v1['funded_amount']
        
deficit_amount.sort_values()
type(deficit_amount)
Out[69]:
pandas.core.series.Series
In [70]:
df_kiva_v1.loc[df_kiva_v1.loc[:,'funded_amount']> df_kiva_v1.loc[:,'loan_amount'],:]
Out[70]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male repayment_amount
277188 425.0 400.0 General Store Retail to buy beverages, rice, laundry detergent, sug... MZ Mozambique Boane, Maputo MZN 17.0 11 monthly 0 1 25.00
338159 3400.0 3000.0 Farm Supplies Agriculture to pay for wires for the grape orchard, cover ... AM Armenia Hoktember village, Armavir region USD 38.0 84 monthly 0 1 89.47

There are two cases where more funds were raised than the loan amount.

lets substitue this value as zero

In [71]:
df_kiva_v1.loc[:,'deficit_amount'] = deficit_amount
df_kiva_v1
Out[71]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male repayment_amount deficit_amount
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 irregular 1 0 25.00 0.0
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 irregular 2 0 52.27 0.0
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 bullet 1 0 3.49 0.0
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 irregular 1 0 18.18 0.0
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 monthly 1 0 28.57 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 monthly 1 0 0.00 25.0
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 monthly 1 0 0.00 25.0
671182 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 monthly 2 0 0.00 125.0
671184 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 monthly 2 0 0.00 875.0
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 monthly 1 0 0.00 250.0

646617 rows × 16 columns

In [72]:
## substituting the negative values with zero .

df_kiva_v1.loc[(df_kiva_v1.loc[:,'deficit_amount'] < 0),'deficit_amount']
df_kiva_v1.loc[(df_kiva_v1['deficit_amount'] < 0),'deficit_amount']= 0
df_kiva_v1.loc[:,'deficit_amount'].min()
Out[72]:
0.0

Feature 3: avg_lending_amount¶

In [73]:
df_kiva_v1.loc[:,'loan_amount'].min()
Out[73]:
25.0

We are checking when lender_count is zero than the funded amount should be zero.

In [74]:
df_kiva_v1.loc[(df_kiva_v1.loc[:,'lender_count']==0) &
               (df_kiva_v1.loc[:,'funded_amount']==0),:]
Out[74]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male repayment_amount deficit_amount
4814 0.0 5000.0 Food Production/Sales Food grow and create new jobs in my company. US United States Unknown USD 24.0 0 bullet 1 0 0.0 5000.0
6784 0.0 10000.0 Cosmetics Sales Retail start a new stream of income by offering retai... US United States Unknown USD 36.0 0 bullet 0 1 0.0 10000.0
10735 0.0 5000.0 Services Services purchase new equipment and market to more pote... US United States Unknown USD 24.0 0 bullet 0 1 0.0 5000.0
12037 0.0 5000.0 Food Production/Sales Food purchase much needed packaging that is sustain... US United States Unknown USD 24.0 0 bullet 0 1 0.0 5000.0
13542 0.0 1500.0 Crafts Arts increase my inventory of handmade candles, dri... US United States Unknown USD 12.0 0 bullet 0 1 0.0 1500.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 monthly 1 0 0.0 25.0
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 monthly 1 0 0.0 25.0
671182 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 monthly 2 0 0.0 125.0
671184 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 monthly 2 0 0.0 875.0
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 monthly 1 0 0.0 250.0

3278 rows × 16 columns

When lender count is zero the funded amount is zero, that means the projects are not funded.

In [75]:
try:
    avg_lending_amount = round((df_kiva_v1.loc[:,'funded_amount']/df_kiva_v1.loc[:,'lender_count']),2)

except ZeroDivisionError:
    avg_lending_amount = 0
    
avg_lending_amount
Out[75]:
0         25.00
1         41.07
2         25.00
3         25.00
4         25.00
          ...  
671179      NaN
671181      NaN
671182      NaN
671184      NaN
671188      NaN
Length: 646617, dtype: float64
In [76]:
df_kiva_v1.loc[:,'avg_lending_amount'] = avg_lending_amount
df_kiva_v1
Out[76]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male repayment_amount deficit_amount avg_lending_amount
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 irregular 1 0 25.00 0.0 25.00
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 irregular 2 0 52.27 0.0 41.07
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 bullet 1 0 3.49 0.0 25.00
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 irregular 1 0 18.18 0.0 25.00
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 monthly 1 0 28.57 0.0 25.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 monthly 1 0 0.00 25.0 NaN
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 monthly 1 0 0.00 25.0 NaN
671182 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 monthly 2 0 0.00 125.0 NaN
671184 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 monthly 2 0 0.00 875.0 NaN
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 monthly 1 0 0.00 250.0 NaN

646617 rows × 17 columns

In [77]:
df_kiva_v1.isnull().sum()
Out[77]:
funded_amount            0
loan_amount              0
activity                 0
sector                   0
use                      0
country_code             0
country                  0
region                   0
currency                 0
term_in_months           0
lender_count             0
repayment_interval       0
borrower_female          0
borrower_male            0
repayment_amount         0
deficit_amount           0
avg_lending_amount    3278
dtype: int64

The division by zero values are substituted as Nan values, we can substitute that with zero.

In [78]:
df_kiva_v1.loc[:,'avg_lending_amount'].fillna(0, inplace=True)

df_kiva_v1
Out[78]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male repayment_amount deficit_amount avg_lending_amount
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 irregular 1 0 25.00 0.0 25.00
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 irregular 2 0 52.27 0.0 41.07
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 bullet 1 0 3.49 0.0 25.00
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 irregular 1 0 18.18 0.0 25.00
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 monthly 1 0 28.57 0.0 25.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 monthly 1 0 0.00 25.0 0.00
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 monthly 1 0 0.00 25.0 0.00
671182 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 monthly 2 0 0.00 125.0 0.00
671184 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 monthly 2 0 0.00 875.0 0.00
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 monthly 1 0 0.00 250.0 0.00

646617 rows × 17 columns

Feature 4: Project category¶

In [79]:
df_kiva_v1.describe()
Out[79]:
funded_amount loan_amount term_in_months lender_count borrower_female borrower_male repayment_amount deficit_amount avg_lending_amount
count 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000
mean 808.045242 866.458816 13.719826 21.115360 1.610250 0.419273 72.433627 58.414231 61.655100
std 1145.419041 1214.276106 8.479912 28.840674 3.077908 1.133063 127.806460 399.069402 181.508479
min 0.000000 25.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 275.000000 275.000000 8.000000 7.000000 1.000000 0.000000 23.750000 0.000000 26.830000
50% 475.000000 500.000000 13.000000 13.000000 1.000000 0.000000 38.040000 0.000000 31.580000
75% 925.000000 1000.000000 14.000000 25.000000 1.000000 1.000000 66.670000 0.000000 43.270000
max 100000.000000 100000.000000 158.000000 2986.000000 50.000000 44.000000 10000.000000 50000.000000 9475.000000
In [80]:
min_value= df_kiva_v1.loc[:,"loan_amount"].min()
min_value

max_value= df_kiva_v1.loc[:,'loan_amount'].max()
max_value
Out[80]:
100000.0

The range of loan_amount varies from 25 USD to 100k USD , we can categorize this value into bins as high funds project to low funds projects. countries undertaking big projects can then be identified.

In [81]:
## categorization of loan_amount

## based on the quantile values in the descriptive information lets define categories 

categories = ['Bronze','Silver','Gold','Platinum']

min_value= df_kiva_v1.loc[:,"loan_amount"].min()
max_value= df_kiva_v1.loc[:,'loan_amount'].max()

loan_range = [min_value - .001 * abs(min_value), 250, 500,1000, max_value + .001 * abs(max_value)]


df_kiva_v1["loan_category"] = pd.cut(x=df_kiva_v1.loc[:,"loan_amount"],
                                bins=loan_range,
                                labels=categories)

df_kiva_v1
Out[81]:
funded_amount loan_amount activity sector use country_code country region currency term_in_months lender_count repayment_interval borrower_female borrower_male repayment_amount deficit_amount avg_lending_amount loan_category
0 300.0 300.0 Fruits & Vegetables Food To buy seasonal, fresh fruits to sell. PK Pakistan Lahore PKR 12.0 12 irregular 1 0 25.00 0.0 25.00 Silver
1 575.0 575.0 Rickshaw Transportation to repair and maintain the auto rickshaw used ... PK Pakistan Lahore PKR 11.0 14 irregular 2 0 52.27 0.0 41.07 Gold
2 150.0 150.0 Transportation Transportation To repair their old cycle-van and buy another ... IN India Maynaguri INR 43.0 6 bullet 1 0 3.49 0.0 25.00 Bronze
3 200.0 200.0 Embroidery Arts to purchase an embroidery machine and a variet... PK Pakistan Lahore PKR 11.0 8 irregular 1 0 18.18 0.0 25.00 Bronze
4 400.0 400.0 Milk Sales Food to purchase one buffalo. PK Pakistan Abdul Hakeem PKR 14.0 16 monthly 1 0 28.57 0.0 25.00 Silver
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
671179 0.0 25.0 Livestock Agriculture [True, u'para compara: cemento, arenya y ladri... PY Paraguay Concepción USD 13.0 0 monthly 1 0 0.00 25.0 0.00 Bronze
671181 0.0 25.0 Livestock Agriculture Reviewed loan use in english. PK Pakistan Attock PKR 13.0 0 monthly 1 0 0.00 25.0 0.00 Bronze
671182 0.0 125.0 Livestock Agriculture Pretend the flagged issue was addressed by KC. MX Mexico Iztacalco MXN 13.0 0 monthly 2 0 0.00 125.0 0.00 Bronze
671184 0.0 875.0 Livestock Agriculture Translated loan use to english. BO Bolivia La Paz BOB 13.0 0 monthly 2 0 0.00 875.0 0.00 Gold
671188 0.0 250.0 Livestock Agriculture Reviewed loan use in english. GH Ghana Dansoman GHS 13.0 0 monthly 1 0 0.00 250.0 0.00 Bronze

646617 rows × 18 columns

In [82]:
df_kiva_v1.loc[:,'loan_category'] = df_kiva_v1.loc[:,'loan_category'].astype('category')

Memory usuage and optimization¶

In [83]:
df_kiva_v1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 646617 entries, 0 to 671188
Data columns (total 18 columns):
 #   Column              Non-Null Count   Dtype   
---  ------              --------------   -----   
 0   funded_amount       646617 non-null  float64 
 1   loan_amount         646617 non-null  float64 
 2   activity            646617 non-null  object  
 3   sector              646617 non-null  object  
 4   use                 646617 non-null  object  
 5   country_code        646617 non-null  object  
 6   country             646617 non-null  object  
 7   region              646617 non-null  object  
 8   currency            646617 non-null  object  
 9   term_in_months      646617 non-null  float64 
 10  lender_count        646617 non-null  int64   
 11  repayment_interval  646617 non-null  category
 12  borrower_female     646617 non-null  int64   
 13  borrower_male       646617 non-null  int64   
 14  repayment_amount    646617 non-null  float64 
 15  deficit_amount      646617 non-null  float64 
 16  avg_lending_amount  646617 non-null  float64 
 17  loan_category       646617 non-null  category
dtypes: category(2), float64(6), int64(3), object(7)
memory usage: 85.1+ MB
In [84]:
df_kiva_v1.loc[:,'sector'] = df_kiva_v1.loc[:,'sector'].astype('category')
In [85]:
df_kiva_v1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 646617 entries, 0 to 671188
Data columns (total 18 columns):
 #   Column              Non-Null Count   Dtype   
---  ------              --------------   -----   
 0   funded_amount       646617 non-null  float64 
 1   loan_amount         646617 non-null  float64 
 2   activity            646617 non-null  object  
 3   sector              646617 non-null  category
 4   use                 646617 non-null  object  
 5   country_code        646617 non-null  object  
 6   country             646617 non-null  object  
 7   region              646617 non-null  object  
 8   currency            646617 non-null  object  
 9   term_in_months      646617 non-null  float64 
 10  lender_count        646617 non-null  int64   
 11  repayment_interval  646617 non-null  category
 12  borrower_female     646617 non-null  int64   
 13  borrower_male       646617 non-null  int64   
 14  repayment_amount    646617 non-null  float64 
 15  deficit_amount      646617 non-null  float64 
 16  avg_lending_amount  646617 non-null  float64 
 17  loan_category       646617 non-null  category
dtypes: category(3), float64(6), int64(3), object(6)
memory usage: 80.8+ MB
In [86]:
## creating an optimal copy of dataframe

df_kiva_v1_opti=df_kiva_v1.copy() 
In [87]:
##Lets check memory usage of data frame

df_kiva_v1_opti.memory_usage(deep=True)
Out[87]:
Index                  5172936
funded_amount          5172936
loan_amount            5172936
activity              44762744
sector                  648164
use                   76085559
country_code          38150403
country               42321602
region                46198305
currency              38797020
term_in_months         5172936
lender_count           5172936
repayment_interval      647045
borrower_female        5172936
borrower_male          5172936
repayment_amount       5172936
deficit_amount         5172936
avg_lending_amount     5172936
loan_category           647041
dtype: int64

The columns 'use' has text input, we can analyse this data in a different method , for the current analysis we do not need this column. we can drop this column to make the data frame optimal.

Similarly the value of currency is not used in the analysis , we can delete this column.

The column region has 51k+ unknown values, we can use this in plotting in graph on world map.

The column country code can be used instead of country in graphs where it is difficult to show the name in full.

The column acitivity can be used in detail analysis of which sectors and their subsets are recieving more funding

In [88]:
##Lets drop the unused column

df_kiva_v1_opti = df_kiva_v1_opti.drop(columns=['use','currency'])
df_kiva_v1_opti.memory_usage(deep=True)
Out[88]:
Index                  5172936
funded_amount          5172936
loan_amount            5172936
activity              44762744
sector                  648164
country_code          38150403
country               42321602
region                46198305
term_in_months         5172936
lender_count           5172936
repayment_interval      647045
borrower_female        5172936
borrower_male          5172936
repayment_amount       5172936
deficit_amount         5172936
avg_lending_amount     5172936
loan_category           647041
dtype: int64
In [89]:
df_kiva_v1_opti.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 646617 entries, 0 to 671188
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype   
---  ------              --------------   -----   
 0   funded_amount       646617 non-null  float64 
 1   loan_amount         646617 non-null  float64 
 2   activity            646617 non-null  object  
 3   sector              646617 non-null  category
 4   country_code        646617 non-null  object  
 5   country             646617 non-null  object  
 6   region              646617 non-null  object  
 7   term_in_months      646617 non-null  float64 
 8   lender_count        646617 non-null  int64   
 9   repayment_interval  646617 non-null  category
 10  borrower_female     646617 non-null  int64   
 11  borrower_male       646617 non-null  int64   
 12  repayment_amount    646617 non-null  float64 
 13  deficit_amount      646617 non-null  float64 
 14  avg_lending_amount  646617 non-null  float64 
 15  loan_category       646617 non-null  category
dtypes: category(3), float64(6), int64(3), object(4)
memory usage: 70.9+ MB
In [90]:
df_kiva_v1.describe()
Out[90]:
funded_amount loan_amount term_in_months lender_count borrower_female borrower_male repayment_amount deficit_amount avg_lending_amount
count 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000 646617.000000
mean 808.045242 866.458816 13.719826 21.115360 1.610250 0.419273 72.433627 58.414231 61.655100
std 1145.419041 1214.276106 8.479912 28.840674 3.077908 1.133063 127.806460 399.069402 181.508479
min 0.000000 25.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 275.000000 275.000000 8.000000 7.000000 1.000000 0.000000 23.750000 0.000000 26.830000
50% 475.000000 500.000000 13.000000 13.000000 1.000000 0.000000 38.040000 0.000000 31.580000
75% 925.000000 1000.000000 14.000000 25.000000 1.000000 1.000000 66.670000 0.000000 43.270000
max 100000.000000 100000.000000 158.000000 2986.000000 50.000000 44.000000 10000.000000 50000.000000 9475.000000
In [91]:
### save the optimized data frame to a csv file.

df_kiva_v1_opti.to_csv('kiva_v1_opti.csv',sep=',')
In [92]:
df_kiva_v1_opti.to_pickle('kiva_v1_after_preprocessing.pkl')
In [ ]: